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Abstract 

We present a novel linear program for the approximation of the dynamic programming cost-to-go func- 
tion in high-dimensional stochastic control problems. LP approaches to approximate DP have typically 
relied on a natural 'projection' of a well studied linear program for exact dynamic programming. Such 
programs restrict attention to approximations that are lower bounds to the optimal cost-to-go function. 
Our program — the 'smoothed approximate linear program' — is distinct from such approaches and relaxes 
the restriction to lower bounding approximations in an appropriate fashion while remaining computation- 
ally tractable. Doing so appears to have several advantages: First, we demonstrate substantially superior 
bounds on the quality of approximation to the optimal cost-to-go function afforded by our approach. 
Second, experiments with our approach on a challenging problem (the game of Tetris) show that the 
approach outperforms the existing LP approach (which has previously been shown to be competitive with 
several ADP algorithms) by an order of magnitude. 



1 . Introduction 



Many dynamic optimization problems can be cast as Markov decision problems (MDPs) and solved, in 
principle, via dynamic programming. Unfortunately, this approach is frequently untenable due to the 'curse 
of dimensionality'. Approximate dynamic programming (ADP) is an approach which attempts to address 
this difficulty. ADP algorithms seek to compute good approximations to the dynamic programing optimal 
cost-to-go function within the span of some pre-specified set of basis functions. 

ADP algorithms are typically motivated by exact algorithms for dynamic programming. The approximate 
linear programming (ALP) method is one such a pproach, motivated by the LP u sed for the computation of 
the optimal c ost-to-go function. Introduced bvlSchweitzer and Seidmannl (119851) and analyzed and further 
developed by lde Farias and Van Royl d2003l . 12004). this approach is attractive for a number of reasons. First, 
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the availability of efficient solvers for linear programming makes the LP approach easy to implement. Second, 
the approach offers attractive theoretical guarantees. In particular, the quality of the approximation to the 
cost-to-go function produced by the LP approach can be shown to compete, in an appropriate sense, with 
the quality of the best possible approximation afforded by the set of basis functions used. A testament 
to the success of the LP approach is the number of applications it has seen in recent years in large scale 
dynamic optimization problems. These applications range from the control of queueing networks to revenue 
management to the solution of large scale stochastic games. 

The optimization program employed in the ALP approach is in some sense the most natural linear pro- 
gramming formulation for ADP In particular, the ALP is identical to the linear program used for exact com- 
putation of the optimal cost-to-go function, with further constraints limiting solutions to the low-dimensional 
subspace spanned by the basis functions used. The resulting LP implicitly restricts attention to approxima- 
tions that are lower bounds to the optimal cost-to-go function. The structure of this program appears crucial 
in establishing guarantees on the quality of approximations produced by the approach; these approximation 
guarantees were remarkable and a first for any ADP method. That said, the restriction to lower bounds 
naturally leads one to ask whether the program employed by the ALP approach is the 'right' math program- 
ming formulation for ADP. In particular, it may be advantageous to relax the lower bound requirement so 
as to allow for a better approximation, and, ultimately, better policy performance. Is there an alternative 
formulation that permits better approximations to the cost-to-go function while remaining computationally 
tractable? Motivated by this question, the present paper introduces a new linear program for ADP we call the 
'smoothed' approximate linear program (or SALP). We believe that the SALP provides a preferable math 
programming formulation for ADP. In particular, we make the following contributions: 

1. We are able to establish strong approximation and performance guarantees for approximations to 
the cost-to-go function produced by the SALP; these guarantees are substantially stronger than the 
corresponding guarantees for the ALP. 

2. The number of constraints and variables in the SALP scale with the size of the MDP state space. 
We nonetheless establish sample complexity bounds that demonstrate that an appropriate 'sampled' 
SALP provides a good approximation to the SALP solution with a tractable number of sampled MDP 
states. Moreover, we identify structural properties for the sampled SALP that can be exploited for fast 
optimization. Our sample complexity results and these structural observations allow us to conclude 
that the SALP is essentially no harder to solve than existing LP formulations for ADP. 

3. We present a computational study demonstrating the efficacy of our approach on the game of Tetris. 
Tetris is a notoriously difficult, 'unstructured' dynamic optimization problem and has been used as 
a convenient testbed problem for numerous ADP approaches. The ALP has been demonstrated to 
be competitive with o ther ADP approaches for T etris, such as temporal difference learning or policy 



gradient methods (see [Farias and Van Ro y. 2006). In detailed comparisons with the ALP, we show that 
the SALP provides an order of magnitude improvement over controllers designed via that approach 
for the game of Tetris. 
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The li t eratur e on ADP algorithms is vast and we make no att empt to survey it here. 



and 



Van Roy (200 2) or 

Bertsekasl (120071 Chap. 6) provide good, brief overviews, while iBertsekas and Tsitsiklisl ( 19961) 



Powell 



(120071) are encyclopedic r eferences on the topic. The exact LP for the solution of dynamic programs is 
attributed to 



Manne ( 1960). The ALP approach to ADP was introduced by 



Schweitzer and Seidmann 



andlde Farias and Van Rov (2003 



(1985) 

2004). Ide Farias and Van Roy! (120031) establish strong approximation guar- 



antees for ALP based approximations assuming knowledge of a 'Lyapunov'-like function which must be 
included i n the basis. The approach we present may be viewed as optimizing over all possible Lyapunov 



functions. 



de Farias and Van Roy! (2006) introduce a program for average cost approximate dynamic pro- 



gramming that resembles the SALP; a critical difference is that their program requires the relative violation 
allowed across ALP constraints be specified as input. Applications of the LP approach to ADP ra nge from 



scheduling in queueing netwo rks (Morrison and Kumar. 



managemen t (lAdelman 



2007: 



(Han, 



Farias and Van Roy 



20051) . inve ntory problems (lAdelmanl 12004; 



2007 



1999; 



Veatch, 



2005 



Moallemi et al. 



2008), revenue 



Zhang and Adelman, 2008), portfolio management 
Adelman and Klabjarl 2009 ), and algorithms for solving 



stochastic games (IFarias et all. I2008T) among others. Remarkably, in a pplications such as network revenue 
management, control policies produced via the LP approach (namely, 



Adelman, 



2007; 



Farias and Van Roy , 



20071) are competitive with ADP approaches that carefully exploit problem structure, such as for instance 



that of 



Topaloglui (12009I) . 



The remainder of this paper is organized as follows: In Sectional we formulate the approximate dynamic 
programming setting and describe the ALP approach. The smoothed ALP is developed as a relaxation of 
the ALP in Section Section @] provides a theoretical analysis of the SALP, in terms of approximation 
and performance guarantees, as well as a sample complexity bound. In Section [51 we describe the practical 
implementation of the SALP method, illustrating how parameter choices can be made as well as how to 
efficiently solve the resulting optimization program. Section[6]contains the computational study of the game 
Tetris. Finally, in Section [TJ we conclude. 



2. Problem Formulation 



Our setting is that of a discrete-time, discounted infinite-horizon, cost-minimizing MDP with a finite state 
space X and finite action space A. At time t, given the current state x t and a choice of action a t , a per-stage 
cost g(xt, at) is incurred. The subsequent state xt+\ is determined according to the transition probability 
kernel P at (x t , •)• 

A stationary policy /i : X —> A is a mapping that determines the choice of action at each time as a 
function of the state. Given each initial state xq = x, the expected discounted cost (cost-to-go function) of 
the policy /x is given by 



^a t g{x t ,n{x t )) 



t=o 



x 



Here, a € (0, 1) is the discount factor. The expectation is taken under the assumption that actions are selected 
according to the policy [i. In other words, at each time t, a t = fi(xt). 
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Denote by P^ G M XxX the transition probability matrix for the policy /i, whose (x,x')th entry is 
P /i f g .\(x,x'). Denote by g M € R the vector whose xth entry is g(x,fi(x)). Then, the cost-to-go function 
J M can be written in vector form as 

oo 
t=0 

Further, the cost-to-go function J M is the unique solution to the equation T^J = J, where the operator T M is 
defined by J = g^ + aP^J. 

Our goal is to find an optimal stationary policy fj,*, that is, a policy that minimizes the expected discounted 
cost from every state x. In particular, 

fi* E argmin J^(x). 

The Bellman operator T is defined component-wise according to 

(TJ)(x) = min g(x, a) + a > P a (x,x')J(x'), \/ x £ X. 

x'ex 

Bellman's equation is then the fixed point equation 

(1) TJ = J. 



Standard results in dynamic programming estab lish that the opti mal cost-to-go function J* is the unique 
solution to Bellman's equation (see, for example. [Bertsekasl. 120071 Chap. 1). Further, if jj* is a policy that is 
greedy with respect to J* (i.e., fi* satisfies TJ* = T M » J*), then /i* is an optimal policy. 



2.1. The Linear Programming Approach 

A number of computational approaches are available for the solution of the Bellman equation. One approach 
involves solving the optimization program: 

maximize v T J 
(2) J 

subject to J < TJ. 

Here, v € R is a vector with positive components that are known as the state-relevance weights. The above 
program is indeed an LP since for each state x, the constraint J(x) < (TJ)(x) is equivalent to the set of \A\ 
linear constraints 

J{x) < g(x, a) + a V] P a (x, x')J(x'), V a £ A. 

We refer to © as the exact LP. 

Suppose that a vector J is feasible for the exact LP ©. Since J < TJ, monotonicity of the Bellman 
operator implies that J < T k J, for any integer k > 1. Since the Bellman operator T is a contraction, T k J 
must converge to the unique fixed point J* as k —> oo. Thus, we have that J < J*. Then, it is clear that 
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every feasible point for © is a component- wise lower bound to J*. Since J* itself is feasible for (12]), it must 
be that J* is the unique optimal solution to the exact LP. 

2.2. The Approximate Linear Program 

In many problems, the size of the state space is enormous due to the curse of dimensionality. In such cases, it 
may be prohibitive to store, much less compute, the optimal cost-to-go function J* . In approximate dynamic 
programming (ADP), the goal is to find tractable approximations to the optimal cost-to-go function J*, with 
the hope that they will lead to good policies. 

Specifically, consider a collection of basis functions {4>i, . . . , 4>k} where each fa : X — > R is a real- 
valued function on the state space. ADP algorithms seek to find linear combinations of the basis functions that 
provide good approximations to the optimal cost-to-go function. In particular, we seek a vector of weights 
r G R K so that 

K 

J*{x) sa J r {x) = ' S ^4>i{x)ri = <£r(x). 

r=l 

Here, we define ^ = [fa fa ■ ■ ■ 4>k] to be a matrix with columns consisting of the basis functions. Given a 
vector of weights r and the corresponding value function approximation <3?r, a policy fi r is naturally defined 
as the 'greedy' policy with respect to <3?r, i.e. as T^ r $r = T$r. 

One way to obtain a set of weights is to solve the exact LP ©, but restricting to the low-dimensional 
subspace of vectors spanned by the basis functions. This leads to the approximate linear program (ALP), 
which is defined by 

maximize v T §r 

(3) 

subject to <!>r < T§r. 

For the balance of the paper, we will make the following assumption: 

Assumption 1. Assume the v is a probability distribution (y > 0, l T v = 1), and that the constant function 
1 is in the span of the basis functions <£. 

The geometric intuition behind the ALP is illustrated in Figure [T(a)| Supposed that talp is a vector that 
is optimal for the ALP. Then the approximate value function 3>rALP will lie on the subspace spanned by 
the columns of <3?, as illustrated by the orange line. $rALP will also satisfy the constraints of the exact LP, 
illustrated by the dark gray region. By the discussion in Section [XT1 this implies that ^alp < J*. In other 
words, the approximate cost-to-go function is necessarily a point-wise lower bound to the true cost-to-go 
function in the span of <J>. 

One can thus interpret the ALP solution talp equivalently as the optimal solution to the program 

minimize II J* — $r|h v 

(4) 

subject to <I>r < T$r. 
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Here, the weighted 1-norm in the objective is defined by 

|| J* - $r||i,„ = v{x)\J*{x) - $r(s)|. 
xex 

This implies that the approximate LP will find the closest approximation (in the appropriate norm) to the 
optimal cost-to-go function, out of all approximations satisfying the constraints of the exact LP. 



J(2) J = $ r J(2) J = $r 




(a) ALP case. (b) SALP case. 



Figure 1: A cartoon illustrating the feasible set and optimal solution for the ALP and SALP, in the case of a 
two-state MDP The axes correspond to the components of the value function. A careful relaxation from the 
feasible set of the ALP to that of the SALP can yield an improved approximation. It is easy to construct a concrete 
two state example with the above features. 



3. The Smoothed ALP 

The J < TJ constraints in the exact LP, which carry over to the ALP, impose a strong restriction on the 
cost-to-go function approximation: in particular they restrict us to approximations that are lower bounds to 
J* at every point in the state space. In the case where the state space is very large, and the number of basis 
functions is (relatively) small, it may be the case that constraints arising from rarely visited or pathological 
states are binding and influence the optimal solution. 

In many cases, the ultimate goal is not to find a lower bound on the optimal cost-to-go function, but rather 
to find a good approximation. In these instances, it may be that relaxing the constraints in the ALP, so as 
not to require a uniform lower bound, may allow for better overall approximations to the optimal cost-to-go 
function. This is also illustrated in Figured] Relaxing the feasible region of the ALP in Figure [T(a)1 to the 



light gray region in Figure [T(b)1 would yield the point ^rsALP as an optimal solution. The relaxation in this 
case is clearly beneficial; it allows us to compute a better approximation to J* than the point <I?rsALP- 

Can we construct a fruitful relaxation of this sort in general? The smoothed approximate linear program 
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(SALP) is given by: 

maximize z/ T $r 

r,s 

(5) subject to <f>r < T<3?r + s, 

7T T S < 6>, s > 0. 

Here, a vector s G of additional decision variables has been introduced. For each state x, s(x) is a 
non-negative decision variable (a slack) that allows for violation of the corresponding ALP constraint. The 
parameter 9 > is a non-negative scalar. The parameter it £ is a probability distribution known as the 
constraint violation distribution. The parameter 9 is thus a violation budget: the expected violation of the 
$r < T$r constraint, under the distribution n, must be less than 9. 
The SALP can be alternatively written as 

maximize v T $>r 

(6) 

subject to 7r T ($r - T$r)+ < 9. 

Here, given a vector J, J + (x) = max( J(x), 0) is defined to be the component-wise positive part. Note that, 
when 9 = 0, the SALP is equivalent to the ALP. When 9 > 0, the SALP replaces the 'hard' constraints of 
the ALP with 'soft' constraints in the form of a hinge-loss function. 

The balance of the paper is concerned with establishing that the SALP forms the basis of a useful 
approximate dynamic programming algorithm in large scale problems: 

• We identify a concrete choice of violation budget 9 and an idealized constraint violation distribution 
it for which the SALP provides a useful relaxation in that the optimal solution can be a better approxi- 
mation to the optimal cost-to-go function. This brings the cartoon improvement in Figure [T]to fruition 
for general problems. 

• We show that the SALP is tractable (i.e., it is well approximated by an appropriate 'sampled' version) 
and present computational experiments for a hard problem (Tetris) illustrating an order of magnitude 
improvement over the ALP. 

4. Analysis 

This section is dedicated to a theoretical analysis of the SALP. The overarching objective of this analysis is 
to provide some assurance of the soundness of the proposed approach. In some instances, the bounds we 
provide will be directly comparable to bounds that have been developed for the ALP method. As such, a 
relative consideration of the bounds in these two cases can provide a theoretical comparison between the ALP 
and SALP methods. In addition, our analysis will serve as a crucial guide to practical implementation of the 
SALP as will be described in Section[5] In particular, the theoretical analysis presented here provides intuition 
as to how to select parameters such as the state-relevance weights and the constraint violation distribution. 
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We note that all of our bounds are relative to a measure of how well the approximation architecture employed 
is capable of approximating the optimal cost-to-go function; it is unreasonable to expect non-trivial bounds 
that are independent of the architecture used. 

Our analysis will present three types of results: 

• Approximation guarantees (Sections 14.21 and |4"3T ): We establish bounds on the distance between ap- 
proximations computed by the S ALP and the optimal value function J* , relative to the distance between 
the best possible approximation afforded by the chosen basis functions and J* . These guarantees will 
indicate that the SALP computes approximations that are of comparable quality to the projection^ of 
J* on to the linear span of $. 



• Performance bounds (Section 14.41 ): While it is desirable to approximate J* as closely as possible, 
an important concern is the quality of the policies generated by acting greedily according to such 
approximations, as measured by their performance. We present bounds on the performance loss 
incurred, relative to the optimal policy, in using an SALP approximation. 

• Sample complexity results (Section 14.51 ): The SALP is a linear program with a large number of 
constraints as well as variables. In practical implementations, one may consider a 'sampled' version of 
this program that has a manageable number of variables and constraints. We present sample complexity 
guarantees that establish bounds on the number of samples required to produce a good approximation 
to the solution of the SALP. These bounds scale linearly with the number of basis function K and are 
independent of the size of the state space X. 

4.1. Idealized Assumptions 

Given the broad scope of problems addressed by ADP algorithms, analyses of such algorithms typically rely 
on an 'idealized' assumption of some sort. In the case of the ALP, one either assumes the ability to solve a 
linear program with as many constraints as there are states, or, absent that, knowledge of a certain idealized 
sampling distribution, so that one can then proceed with solving a 'sampled' version of the ALP. Our analysis 
of the SALP in this section is predicated on the knowledge of this same idealized sampling distribution. In 
particular, letting \i* be an optimal policy and P^* the associated transition matrix, we will require access to 
samples drawn according to the distribution 7r M * v given by 

oo 

(7) ttJ^ 4 (1 - a )u T (I - aP^)- 1 = (1 - a) £ a t u T P t ^. 

t=o 

Here u is an arbitrary initial distribution over states. The distribution -k^* ^ may be interpreted as yielding the 
discounted expected frequency of visits to a given state when the initial state is distrib uted according to v and 



the sy stem runs under the policy fj,*. We note that the 'sampled' ALP introduced by lde Farias and Van Roy 



(120041) requires access to states sampled according to precisely this distribution. Theoretical analyses of 



'Note that it is intractable to directly compute the projection since J* is unknown. 



8 



other approaches to approximate DP such as approximate value iteration and temporal difference learning 
similarly rely on the knowled ge of specialized sampling distributions that cannot be obtained tractably (see 
de Farias and Van Rovu2000D . 



4.2. A Simple Approximation Guarantee 

This section presents a first, simple approximation guarantee for the following specialization of the S ALP in 
©, 

maximize v T Qr 

r,s 

(8) subject to <I>r < T$r + s, 

Here, the constraint violation distribution is set to be ftp* v . 

Before we state our approximation guarantee, consider the following function: 

£(r,9) = minimize 7/(1 — a) 

8,7 

(9) 

subject to <3>r — T<T?r < s + 7I, 
7rJ v a<0, s>0. 

Suppose we are given a vector r of basis function weights and a violation budget 9. As we will shortly 
demonstrate, £(r, 9) defines the minimum translation (in the direction of the vector 1) of r such so as to get 
a feasible solution for ©. We will denote by s(r, 9) the s component of the solution to ©. The following 
lemma, whose proof may be found in Appendix lAl characterizes the function £(r, 9): 

Lemma 1. For any r € M. K and 9 > 0: 

(i) £(r, 9) is a finite-valued, decreasing, piecewise linear, convex function of 9. 

(ii) 

£{r,9) < —t^lir-SrlU. 
1 — a 

(hi) The right partial derivative of £(r, 9) with respect to 9 satisfies 

0+ 



39+ 

where 



£(r,0) = - [ (1-a) ^vO) 

x£fl(r) 



Q(r) = argmax Qr(x) — T<&r(x). 



Armed with this definition, we are now in a position to state our first, crude approximation guarantee: 
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Theorem 1. Suppose that tsalp is an optimal solution to the SALP ©, and let r* satisfy 



r* G argmin II J* — <£r| 



r 



Then, 

29 

(10) \\J*-&snAi, v <\\J*-*r*\\ 00 +l(r*,B)+ ~ 



I — a 



The above theorem allows us to interpret £(r * , 9) +29 / ( 1 — a) as an upper bound to the approximation error 
(in the || • norm) associated with the SALP solution tsalp> relative to the error of the best approximation 
r* (in the || • norm). This theorem also provides justification for the intuition, described in Section |3j 
that a relaxation of the feasible region of the ALP will result in better value function approximations. To see 
this, consider the following corollary: 

Corollary 1. Define £/salp(#) to be the upper bound in (flUt . i.e., 

29 



Then: 
(i) 



U S ALp(e) = \\J* -<^r*\\ OD +£(r*,9) + 

1 — a 



tfsALp(O) < -^-II^T-Srloo. 
1 — a 



(ii) The right partial derivative of Usalp(9) with respect to satisfies 



d + 1 

' -£W(0) 



d9+ ^ r w I- a 



-i 



2 ~ ( Yl 

ig!)(r*) 



Proof. The result follows immediately from Parts ([n]) and §m§ of Lemma Q] ■ 

Suppose that 9 = 0, in which case the SALP © is identical to the ALP ©, thus, tsalp = ?~alp- Applying 
Part (0) of Corollary [Q we have, for the ALP, the approximation error bound 

(11) || J* - ^alpIIi,, < — — — II-/* - $rloo. 

1 — a 



This is precisely Theorem 2 of Ide Farias and Van Roy! (I2003h : we recover their approximation guarantee for 
the ALP. 

Now observe that, from Part <£xT]> of Corollary [T] if the set f2(r*) is of very small probability according to 
the distribution tt^*^, we expect that the upper bound f7sALp(#) will decrease dramatically as 9 is increased 



from o| In other words, if the Bellman error &r*(x) — T$r*(x) produced by r* is maximized at states x 



2 Already if 7r M .,„(fi(r*)) < 1/2, -^-(7 SALP (0) < 
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that are collectively of small very probability, then we expect to have a choice of 6 > for which 

£W(#) « £W(0) < — ^ — 1| J* - Sri*,. 

1 — a 

In this case, the bound (fTQb on the SALP solution will be an improvement over the bound (fTTT) on the ALP 
solution. 

Before we present the proof of Theorem Q] we present an auxiliary claim that we will have several 
opportunities to use. The proof can be found in Appendix lAl 

Lemma 2. Suppose that the vectors J € and s € R x satisfy 

J < T^J + s. 

Then, 

J < J* + A*s, 

where 

oo 
fc=0 

and P M * is the transition probability matrix corresponding to an optimal policy. 
In particular, if (r, s) is feasible for the LP ©. Then, 

<3?r < J* + A*s. 

A feasible solution to the ALP is necessarily a lower bound to the optimal cost-to-go function, J* . This 
is no longer the case for the SALP; the above lemma characterizes the extent to which this restriction is 
relaxed. 

We now proceed with the proof of Theorem [T] 
Proof of Theorem Q] First, define the weight vector r € M m by 

and set s = s(r*,8), the s-component of the solution to the LP © with parameters r* and 6. We will 
demonstrate that (f, s) is feasible for ©. Observe that, by the definition of the LP ©, 

$r* < T$r* + S + (1 - a)£(r*,9)l. 
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Then, 



7W = T$r* - a£(r*,9)l 

> $r* - s - (1 - a)£(r*, 0)1 - c^(r*, (9)1 
= <£f — s. 

Now, let (r S ALP, s) be the solution to the SALP ®. By Lemma|2l 

II J* - *rsALp||i,i/ < II-/* - ^salp + A^Hx^ + IIA^Hi,^ 
= v T (J* - $r SAL p + A*s) + z^ T A*s 

-r- 27rL ,.S 

= ^ T (J* - $ rSALP) + 

1 — a 

< i/ T (J* - $r SAL p) + 



1 - a 

< v T {T - *f) + — ^- 

1 — a 

< H^-^IIoo + t^- 

1 — a 



20 



< || J* - $r*|[oo + - $f ||oo + 1 

1 — a 

20 

= 11^-^*1100+^,0)+ - 



1-a 

as desired. ■ 

While Theorem [Qreinforces the intuition (shown via Figure [J) that the SALP will permit closer approx- 
imations to J* than the ALP, the bound leaves room for improvement: 

1. The right hand side of our bound measures projection error, || J* — <J>r*||oo in the || • norm. Since 
it is unlikely that the basis functions $ will provide a uniformly good approximation over the entire 
state space, the right hand side of our bound could be quite large. 

2. As suggested by (01), the choice of state relevance weights can significantly influence the solution. In 
particular, it allows us to choose regions of the state space where we would like a better approximation 
of J* . The right hand side of our bound, however, is independent of v. 

3. Our guarantee does not suggest a concrete choice of the violation budget parameter 0. 

The next section will present a substantially refined approximation bound, that will address these issues. 

4.3. A Stronger Approximation Guarantee 

With the intent of deriving stronger approximation guarantees, we begin this section by introducing a 'nicer' 
measure of the quality of approximation afforded by <I>. In particular, instead of measuring the approximation 
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error J* — <3?r* in the || • ||oo norm as we did for our previous bounds, we will use a weighted max norm 
defined according to: 

llxll A 

llJII^-max -^y. 

Here, ip: X — > [l,oo) is a given 'weighting' function. The weighting function ip allows us to weight 
approximation error in a non-uniform fashion across the state space and in this manner potentially ignore 
approximation quality in regions of the state space that are less relevant. We define ^ to be the set of all 
weighting functions, i.e., 

* = {V € R X : V > 1} • 
Given a particular ifi £ ty, we define a scalar 

Ex' P a(x,x')lp(x') 



f3(tp) = max 

x,a 



One may view (3(ip) as a measure of the 'stability' of the system. 

In addition to specifying the sampling distribution ir, as we did in Section l4~2l we will specify (implicitly) 
a particular choice of the violation budget 8. In particular, we will consider solving the following SALP: 



(12) 



T 2tt\ s 

maximize v <£r — - 



I'.S 



l-a 



subject to <&r < T<3?r + s, s>0. 

It is clear that (fT2l is equivalent to (© for a specific choice of 9. We then have: 
Theorem 2. If tsalp is an optimal solution to (fT2l . then 

2(7rL> + l)(a/3ty) + l) 



I J* -$r SA Lp||i, v < mf ||J* -^Hoo^m 



r^e* ' \ 1 — a I 

Before presenting a proof for this approximation guarantee, it is worth placing the result in context to 



under stand its implications. Fo r this, we recall a closely rel ated result shown by Ide Farias and Van Roy 



(120031) for the ALP. In particular, Ide Farias and Van Royl ((2003) showed that given an appropriate weighting 



function (in their context, a 'Lyapunov' function) ip, one may solve an ALP, with ip in the span of the basis 
functions <£. The solution talp to such an ALP then satisfies: 

(13) || J* - ^alpIIi,, < mf || J* - *r\\oo,l/4,, ^ttv, 

provided that /3(V>) < 1/a. Selecting an appropriate tp in their context is viewed to be an important task for 
pract ical performance and often requires a good deal of problem specific analysis; 



de Farias and Van Roy 



(2003) identify appropriate tp for several queueing models. Note that this is equivalent to identifying a 
desirable basis function. In contrast, the guarantee we present optimizes over all possible ip (including those 
V> that do not satisfy the Lyapunov condition (3(ip) < 1/a, and that are not necessarily in the span of $). 
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To make the comparison more precise, let us focus attention on a particular choice of v, namely v = 
7i>* = 7r*, the stationary distribution induced under an optimal policy fj,*. In this case, restricting attention 
to the set of weighting functions 

= e if : a (3(^) < 1}, 
so as to make the two bounds comparable, Theorem |2]guarantees that 

T . , 2(7rJV + !)(«/? (ip) +1) 



| J* - *rsALp||l,i/ < inf \\J* -^r\\ ool/lp [TTjip + 



< inf ||J*-$r| 



!l ' wOO,l/lb'-t 

r,ip€f ' 1 — Q 

On the other hand, observing that f3(ip) > 1 for all ip G ^, the right hand side for the ALP bound (PT3T ) is at 
least 



1 — a 



inf II J* - $r| 



00,1/1/)" 



1 — a 

Thus, the approximation guarantee of Theorem |2] allows us to view the S ALP as automating the critical 
procedure of identifying a good Lyapunov function for a given problem. 

Proof of Theorem |2] Let r £ M m and ip G 'J be arbitrary. Define the vectors e, s E component- wise by 

e(x) = (($r)(x) - (T$r)(x))+, 
1 



Notice that < s < e. 

We next make a few observations. First, define f according to 

~ ^ 1 1 ^ 1 1 00,1/1/' 
1 — a 

Observe that, by a similar construction to that in Theorem [T] (f, s) is feasible for (fT2l . Also, 

<£r - $r L = — — -t— < 



1 — a 1 — a 



Furthermore, 



< (^*,u^W\\oo,l/^ 

< (7rJ. ) ^)||r*r-*r|| 0Oil/ ^. 
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Finally, note that 

..T/ T* Pf^„\ ^ f-.T_/.MI T* ^„ll 

1/v 



„T(j*_$ r ) < (^)||J*-<I>r|l lA/> . 



Now, suppose that (rsALP, s) is an optimal solution to the SALP (fT2l) . We have from the last set of 
inequalities in the proof of Theorem Q] and the above observations, 



| J* - SrsALpHl,!/ < " T (J* ~ $r SALP ) + 



< u T (J* - Sf) + 



27rJ. il/ s 



1 — Q 



1 - a 



27T T S 

(W) <^T(j*_ $r ) + I/ T^ r _ $ ^ + _^ 



< i/ T (J* - $r) + ||$r - $f| 



1 — a 

oo ~r 

1 — a 



||T<&) $r||no 1 /,/, / -r \ 

< (u ip)\\J - <Mloo,i/«. + V + 2 ^>»V ■ 

Since our choice of r and ip were arbitrary, we have: 

1 1 T$r — <I>r lino 1 /,/, / -r 

(15) || J* - <I>r SAL p !,„ < inf (v if>) \\J* - $r M)1/ ^ + ^ 1 + 2ttJ v ^ 

We would like to relate the Bellman error term T<£r — <I>r on the right hand side of (fT5T ) to the approximation 
eiTor J* — <3>r. In order to do so, note that for any vectors J, J S R , 

|TJ(x) -TJ(x)| <amax V P a (x, x')\ J(x') - J(x')\. 



x'£X 



Therefore, 



lTdi -|| . £ txl P a (x,x f )\*r( X ')-J*(x f 

\1 <Pr — J Lq < amax — 



, |<E>r(a 



Thus, 
(16) 



< amax 

x,a ip(x 

< a/3(V)|| J* - ^r-Hoo,!/^. 



|T$r - ^rlloo,!/^ < ||T$r - J*||oo,i/V + \\J* ~ ^|loc,i/V 
< || J* - ^((^^(l + q/3(V)). 



Combining (TT51 and ([TBI , we get the desired result. ■ 
The analytical results provided in S ections l4T2l and l4~3l pro vide bounds on the quality of the approximation 
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provided by the SALP solution to J*. The next section presents performance bounds with the intent of 
understanding the increase in expected cost incurred in using a control policy that is greedy with respect to 
the SALP approximation in lieu of the optimal policy. 



4.4. A Performance Bound 

We will momentarily present a result that will allow us to interpret the objective of the SALP (TL71 ) as an upper 
bound on the performance loss of a greedy policy with respect to the SALP solution. Before doing so, we 
briefly introduce some relevant notation. For a given policy \i, we denote 

oo 

\-l 



k=0 

Thus, A* = A^* . Given a vector J € R^, let \ij denote the greedy policy with respect to J. That is, \i j 
satisfies T At / J = TJ. Recall that the policy of interest to us will be //$ rS ALP f° r a solution tsalp to the SALP. 
Finally, for an arbitrary starting distribution over states r/, we define the 'discounted' steady state distribution 
over states induced by fij according to 

oo 

u( V , J) T 4 (1 - aW X> P M,) fe = (1 " «)>7 T A^. 

We have the following upper bound on the increase in cost incurred by using \ij in place of //*: 
Theorem 3. 

W^j - J*h,v < J ) T ( J * " J ) + rr^rf*Av,j)( J ~ TJ ) + ) • 

Theorem [3] indicates that if J is close to J*, so that (J — TJ) + is also small, then the expected cost 
incurred in using a control policy that is greedy with respect to J will be close to optimal. The bound indicates 
the impact of approximation errors over differing parts of the state space on performance loss. 

Suppose that (Vsalp, s) is an optimal solution to the SALP (fl2l . Then, examining the proof of Theorem[2] 
and, in particular, ([LIT ), reveals that 

u T (J* - $r SAL p) + z t^,*, v s 

1 — a * ' 

(17) / our 



< inf \\J* - QrW^y^ U> + 



2(7T / ;* i > + i)(a/?(V0 + l) 



r^e* \ 1 — a 

Assume that the state relevance weights v in the SALP (f!2l satisfy 

(18) v = ^(r/,$rsALp)- 
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Then, combining Theorem |2] and (fTT)) yields 

run 117 Pll <r 1 ( ■ f 117* a II ( t, ■ 2(7r / T.^ + l)(«/3W + l) ^ 
( 19 ) II " J llM < 73^ ^ II J " <Mki/, ^ i> + ^ J J • 

This bound directly relates the performance loss of the SALP policy to the ability of the basis function 
architecture to approximate J*. Moreover, this relationship allows us to loosely interpret the SALP as 
minimizing an upper bound on performance loss. 

Unfortunately, it is not clear how to make an a-priori choice of the state relevance weights v to satisfy (PT8T ). 
since the choice of v determines the solution to the SALP tsalp; this is essentially the situation one faces in 
performance analyses for approxi mate dynamic programming a lgorithms such as approximate value iteration 



and temporal difference learning (Ide Farias and Van Royl. 2000). Indeed, it is not clear that there exists a v 



Lining at; 



that solves the fixed point equation (PT8T ). On the other hand, given a choice of v so that v 1/(77, ^salp), 
in the sense of a small Radon-Nikodym derivative betwee n the two distribution s , an a pproximate version of 



the performance bound (PT9l) will hold. As suggested by Ide Farias and Van Royl (120031) in the ALP case, one 
possibility for finding such a choice of state relevance weights is to iteratively resolve the SALP, and at each 
time using the policy from the prior iteration to generate state relevance weights for the next iteration. 

Proof of Theorem[3] Define s = (J — TJ) + . From Lemma[2j we know that 

J < J* + A*s. 

Applying T^* to both sides, 

Tfj*J <J* + aP^*A*s = J* + A*s - s < J* + A*s, 

so that 

(20) TJ < Tfj,*J < J* + A*s. 



Then, 



r, T (J^ - J) = r/ T Xy^Gfc + oP^J- J) 



k=0 

J , 



(21) = V l A liJ (TJ-J) 

<r/ T A MJ (J*- J + A*s) 



' 1/(77, J) T (J* ~ J + A*s). 



1 — a 

where the second equality is from the fact that + aP Mj J = T^j J = TJ, and the inequality follows from 
(ED). 
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Further, 



7] T (J-J*) < 7] T A*S 

(22) < ^? T A MJ A* S 

= -^—u( V , J) T A*s. 
1 — a 

where the second inequality follows from the fact that A*s > and A Mj = I + J2h=i ak Pflj- 
It follows from (ED and d22]> that 



^(J^ - J*) = 7? T ( J MJ - J) + V T (J - J*) 

< —^—u(r], J) T ( J* - J + 2A*s) 
1 — a 

' ^( I7 ,J)T ( j*_j )+ ^ T 



which is the result. ■ 
4.5. Sample Complexity 

Our analysis thus far has assumed we have the ability to solve the SALP. The number of constraints and 
variables in the SALP is grows linearly with the size of the state space X. Hence, this program will typically 
be intractable for problems of interest. One solution, which we describe here, is to consider a sampled 
variation of the SALP, where states and constraints are sampled rather than exhaustively considered. In this 
section, we will argue that the solution to the SALP is well approximated by the solution to a tractable, 
sampled variation. 

In particular, let X be a collection of 5 states drawn independently from the state space X according to 
the distribution vr^. ^. Consider the following optimization program: 

maximize v T &r — — s(x) 

r,s (1 - a )S ^ y ' 

(23) xeX 

subject to <f>r(x) < T^r(x) + s(x), V x G X, 

s > 0, r G M. 

Here, J\f C 1 K is a bounding set that restricts the magnitude of the sampled SALP solution, we will discuss 
the role of M shortly. Notice that (1231) is a variation of (fT2l) . where only the decision variables and constraints 
corresponding to the sampled subset of states are retained. The resulting optimization program has K + S 
decision variables and S\A\ linear constraints. For a moderate number of samples S, this is easily solved. 
Even in scenarios where th e size of the action space A is large, it is freque ntly possible to rewrite (T231 as a 



compact linear program (IFarias and Van Royl 120071 : iMoallemi et all 120081) . The natural question, however, 



is whether the solution to the sampled SALP (T231 is a good approximation to the solution provided by the 
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SALP ([12]) . for a 'tractable' number of samples S. 

Here, we answer this question in the affirmative. We will provide a sample complexity bound that 
indicates that for a number of samples S that scales linearly with the dimension of <3>, K, and that need not 
depend on the size of the state space, the solution to the sampled SALP satisfies, with high probability, the 
approximation guarantee presented for the SALP solution in Theorem [2] 

Our proof will rely on the following lemma, which provides a Chernoff bound for the uniform convergence 
of a certain class of functions. The proof of this lemma, which is based on bounding the pseudo-dimension 
of the class of functions, can be found in Appendix IB] 



Lemma 3. Given a constant B > 0, define the function £: 



[0, B] by 



C(t) = max(min(i,5),0) . 



Consider a pair of random variables (Y, Z) E 



For each i = 1, . . . , n, let the pair (yW,ZW) be an 



i.i.d. sample drawn according to the distribution of (Y, Z). Then, for all e E (0, B], 



P sup 



1=1 



> e 



e 2 n 



^ 8 [— los —) exp V f,J^ 



Moreover, given S E (0, 1), if 



n > 



645 2 



2(if + 2)log^ + log^). 



then this probability is at most 6. 



In order to establish a sample complexity result, we require control over the magnitude of optimal 
solutions to the SALP (fT2l . This control is provided by the bounding set TV". In particular, we will assume 
that TV is large enough so that it contains an optimal solution to the SALP (fT2l . and we define the constant 



(24) 



B= sup ||($r-T$r) + || 00 . 



This quantity is closely related to the diameter of the region TV". Our main sample complexity result can then 
be stated as follows: 

Theorem 4. Under the conditions of Theorem |2j let tsalp be an optimal solution to the SALP (fT2l . and let 
^salp be an optimal solution to the sampled SALP (|23T >. Assume that tsalp E TV". Further, given e E (0, B] 
and 5 E (0, 1/2], suppose that the number of sampled states S satisfies 

S > — =- 2(K + 2) log + log - 
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Then, with probability at least 1 — 5 — 2 



-383x128 



IT* *~ II ^-flir ail f t, , 2(n^ + l)(am + l) \ , 4e 

reAf \ 1 — a / 1 



n 



Theorem 0] establishes that the sampled S ALP (1231) provides a close approximation to the solution of 
the S ALP (TT2l . in the sense that the approximation guarantee we established for the S ALP in Theorem |2] 
is approximately valid for the solution to the sampled SALP, with high probability. The theorem precisely 
specifies the number of samples required to accomplish this task. This number depends linearly on the 
number of basis functions and the diameter of the feasible region, but is otherwise independent of the size of 
the state space for the MDP under consideration. 

It is worth juxtaposing our sample complexity result with that available for the ALP ©. Recall that 
the ALP has a large number of constraints but a small number of variables; the SALP is thus, at least 
superficially, a significantly more complex program. Exploiting the fact that the ALP has a small number 



of variables, Ide Farias and Van Royl (120041) establish a sample complexity bound for a sampled version of 
the ALP analogous to the the sampled SALP (T23T ). The number of samples required for this sampled ALP 
to produce a good approximation to the ALP can be shown to depend on the same problem parameters we 
have identified here, viz.: the constant B and the number of basis functions K. The sample complexity 
in the ALP case is identical to the sample complexity bound established here, up to constants and a linear 
dependence on the ratio B/e. This is as opposed to the quadratic dependence on B/e of the sampled SALP. 
Although the two sample complexity bounds are within polynomial terms of each other, one may rightfully 
worry abut the practical implications of an additional factor of B/e in the required number of samples. In 
the computational study of Section [6l we will attempt to address this concern. 

Finally, note that the sampled SALP has K + S variables and S\A\ linear constraints whereas the sampled 
ALP has merely K variables and S\A\ linear constraints. Nonetheless, we will show in the Section [57X1 that 
the special structure of the Hessian associated with the sampled SALP affords us a linear computational 
complexity dependence on S. 

Proof of Theorem 01 Define the vectors 

V - ($ f SALP - 7>$f SAL p) + , and s = ($f SA LP - T$>r SALV ) + . 
One has, via Lemma |2J that 

$r SAL p - J* < A%» 
Thus, as in the last set of inequalities in the proof of Theorem [TJ we have 

2tt t s * 

(25) || J* - $f SA Lp|k. < v T (J* - $r SALP ) + ' " ■" ' 



1 — a 

Now, let b e the empirical measure induced by the collection of sampled states X . Given a state 
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x G X, define a vector Y(x) G and a scalar Z(cc) € R according to 

Y{x) 4 $(x) T -aP^^x) 1 , Z(x) 4 - 5 (x,//(x)), 
so that, for any vector of weights r G N, 



(<£>r(x) -7>$r(x)) + = C r T Y(x) + Z(x) 



Then, 



< sup 



- ^ C (W(x) + Z(x)) - ^ 7r„ v (x)C (W(x) + Z(x) 



Applying Lemma [3j we have that 



(26) 



>e) <S. 



Next, suppose (Vsalp, s) is an optimal solution to the SALP (fT2l . Then, with probability at least 1 — 5, 

2e 



^{J* - $r SALP ) + < - <I>f SAL p) + + 



I- a 



1 — a 1 — a 



(27) 



< ^ T (J* - $f SALP ) + 



+ 



2e 



< u T (J* - $r SALP ) + 



1 — a 1 — a 
2ttT, „a 



+ 



2f 



1 — a 1 — a ' 



where the first inequality follows from (1261 ). and the final inequality follows from the optimality of (vsalp, s) 
for the sampled SALP (T231) . 

Notice that, without loss of generality, we can assume that s(x) = (^r$ALp{x) — T<l?rs A Lp(x)) + , for 
each x G X. Thus, < s[x) < B. Applying Hoeffding's inequality, 



><) < 2cxp( ) < 2- M ^ 2s 



where final inequality follows from our choice of S. Combining this with (|25T ) and (1271 . with probability at 
least 1 - (5 - 2" 383 5 128 , we have 



| J* - $fsALp||i,</ < v T (J* - $r SAL p) + 



+ 



2e 



1 — a 1 — a 



<v (J* — $r SAL p) + 
The result then follows from (fl4)) - (fT6l) in the proof of Theorem 12 



+ 



4e 



1 — a 1 — a 
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An alternative sample complexity bound of a similar flavor can be developed using results from the 
stochastic programming literature. The key idea is that the SALP (fT2l can be reformulated as the following 
convex stochastic programming problem: 



(28) 



maximize , 



$r(xo) 



a 



(<5r(x) - T<3?r(x))~ 



where xq,x S X have distributions v and tt^*^, respectively. Interpreting the sampled SALP (|23l as a 
sa mple average approx imation of (|28T ). a sample complexity bound can be developed using the methodology 



of 



Shapiro et al.l (|2009l. Chap. 5), for example. This proof is simpler than the one presented here, but yields 



a cruder estimate that is not as easily compared with those available for the ALP. 



5. Practical Implementation 

The SALP (0), as it is written, is not directly implementable. As discussed in Section 14.51 the number of 
variables and constraints grows linearly with the size of the state space X, making the optimization problem 
intractable. Moreover, it is not clear how to choose parameters such as the probability distributions v and 
7r or the violation budget 9. However, the analysis in Section |4] provides insight that allows us to codify a 
recipe for a practical and implementable variation. 
Consider the following algorithm: 

1. Sample S states independently from the state space X according to a sampling distribution p. Denote 
the set of sampled states by X. 

2. Perform a line search over increasing choices of 6 > 0. For each choice of 6, 

(a) Solve the sampled SALP: 

maximize — V^($r)(ic) 
r,s S { 

x&X 

subject to <£r(x) < T<frr(x) + six), V x G X, 

(29) 

s>0. 

(b) Evaluate the performance of the policy resulting from (l29l via Monte Carlo simulation. 

3. Select the best of the policies evaluated in Step [2] 
This algorithm takes as inputs the following parameters: 

• 3>, a collection of K basis functions. 
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S, the number of states to sample. By sampling S states, we limit the number of variables and constraints 
in the sampled SALP (T291 ). Thus, by keeping S small, the sampled SALP becomes tractable to solve 
numerically. On the other hand, the quality of the approximation provided by the sampled SALP may 
suffer is S is chosen to be too small. The sample complexity theory developed in Section l4~5l suggests 
that S can be chosen to grow linearly with K, the size of the basis set. In particular, a reasonable 
choice of S need not depend on the size of the underlying state space. 

In practice, we choose S 3> K to be as large as possible subject to limits on the CPU time and memory 
required to solve (1291) . In Section I57T1 we will discuss how the program (1291) can be solved efficiently 
via barrier methods for large choices of S. 

p, a sampling distribution on the state space X. The distribution p is used, via Monte Carlo sampling, 
in place of both the distributions v and it in the SALP ©. Recall that the bounds in Theorems Q] and |2] 
provide approximation guarantees in a z^- weighted 1-norm. This suggests that v should be chosen to 
emphasize regions of the state space where the quality of approximation is most important. Similarly, 
the theory in Section @] suggests that the distribution it should be related to the distribution induced by 
the optimal policy. 

In practice, we choose p to be the stationary distribution under a baseline policy. States are then sampled 
from p via Monte Carlo simulation of the baseline policy. This baseline policy can correspond, 
for example, to a heuristic control pol icy for the system. More sophisticated procedures such as 
'bootstrapping' can also be considered (Farias and Van Royl 120061) . Here, one starts with a heuristic 
policy to be used for sampling states. Given the sampled states, the application of our algorithm 
will result in a new control policy. The new control policy can then be used for state sampling in a 
subsequent round of optimization, and the process can be repeated. 



Note that our algorithm does not require an explicit choice of the violation budget 9, since we optimize 
with a line search over the choices of 9. This is motivated by the fact that the sampled SALP (T2"91 ) can 
efficiently resolved for increasing values of 9 via a 'warm-start' procedure. Here, the optimal solution of the 
sampled SALP given previous value of 6 is used as a starting point for the solver in a subsequent round of 
optimization. Using this method we observe that, in practice, the total solution time for a series of sampled 
SALP instances that vary by their values of 9 grows sub-linearly with the number of instances. 



5.1. Efficient Linear Programming Solution 

The sampled SALP (T291 ) can be written explicitly in the form of a linear program: 



(30) 



maximize c r 

r,s 



subject to 



An Aia 




r 


d T 




s 



s > 0. 



< 6, 
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Here, b G R s ^ +1 , c G R K , and d£R s are vectors, A n G M 5 I- 4 I X ^ is a dense matrix, and A 12 G M 5 l- 4 l x5 
is a sparse matrix. This LP has K + S decision variables and S\A\ + 1 linear constraints. 

Typically, the number of sampled states S will be quite large. For example, in Section [6l we will discuss 
an example where K = 22 and S = 300,000. The resulting LP has approximately 300,000 variables and 
6,600,000 constraints. In such cases, with many variables and many constraints, one might expect the LP to 
be difficult to solve. However, the sparsity structure of the constraint matrix in (IBTJl) and, especially, that of 
the sub-matrix A\ 2 , allows efficient optimization of this LP. 

In particular, imagine solving the LP (T3TJ1 ) with a barrier method. The comput ational bottleneck of such 
a met hod is the inner Newton step to compute a central point (see, for example, iBoyd and Vandenberghd . 
2004). This step involves the solution of a system of linear equations of the form 



(31) 



H 



Ar 
As 



Here, g G is a vector and H G r( k+s ) x ( k+s ^ is the Hessian matrix of the barrier function. Without 

exploiting the structure of the matrix H, this linear system can be solved with 0((K + S) 3 ) floating point 
operations. For large values of S, this may be prohibitive. 

Fortunately, the Hessian matrix H can be decomposed according to the block structure 



H 



ii 



J 2 



H\2 

H 22 



where H n G R KxK , H 12 G R KxS , and H 22 G R SxS . In the case of the LP ©, it is not difficult to see 
that the sparsity structure of the sub-matrix A\ 2 ensures that the sub-matrix H 22 takes the form of a diagonal 
matrix plus a rank-one matrix. This allows the linear system (T3T1 ) to be solved with 0(K 2 S + K 3 ) floating 
point operations. This is linear in S, the number of sampled states. 



6. Case Study: Tetris 

Tetris is a popular video game designed and developed by Alexey Pazhitnov in 1985. The Tetris board, 
illustrated in Figured consists of a two-dimensional grid of 20 rows and 10 columns. The game starts with 
an empty grid and pieces fall randomly one after another. Each piece consists of four blocks and the player 
can rotate and translate it in the plane before it touches the 'floor'. The pieces come in seven different shapes 
and the next piece to fall is chosen from among these with equal probability. Whenever the pieces are placed 
such that there is a line of contiguous blocks formed, a point is earned and the line gets cleared. Once the 
board has enough blocks such that the incoming piece cannot be placed for all translation and rotation, the 
game terminates. Hence the goal of the player is to clear maximum number of lines before the board gets 
full. 

Our interest in Tetris as a case study for the SALP algorithm is motivated by several facts. First, 
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Figure 2: Example of a Tetris board configuration 



theor etical r e sults suggest that design of an optimal Tetris player is a difficult problem. iBrzustowskil (119921) 
and iBurgiell dl99_7|) have shown that the game of Tetris has to end with probability one, under all policies. 



They demonstrate a s equence of pieces, which leads to termination state of game for all possible actions. 



Demaine et al 



(2003) consider the offline version of Tetris and provide computational complexity results 
for 'optimally' playing Tetris. They show that when the sequence of pieces is known beforehand it is NP- 
complete to maximize the number of cleared rows, minimize the maximum height of an occupied square, or 
maximize the number of pieces placed before the game ends. This suggests that the online version should 
be computationally difficult. 

Second, Tetris represents precisely the kind of large and unstructured MDP for which it is difficult to design 
heuristic controllers, and hence policies designed by ADP algorithms are particularly relevant. Moreover, 
Tetris has been employed by a number of researchers as a testbed problem. One of the important steps in 
applying these techniques is the choice of basis functions. Fortunately, there is a fixed set of basis functions, 
to be described shor t ly, which have been used by research ers while applying tempora l-difference lea rning 



(Bert sekas and Iof fe. 1996; 



Bertsekas and Tsitsiklis , 



approximate linear programming (iFarias and Van Roy[l2006h . Hence, application of SALP to Tetris allows 
us to make a clear comparison to other ADP methods. 

The SALP methodology described in Section [5] was applied as follows: 



1996), policy gradient methods (IKakadd. 120021) . and 



MDP formulation. We used the formulation of Tetris as a Markov decision problem of lFarias and Van Roy 
(2006). Here, the 'state' at a particular time encodes the current board configuration and the shape of 
the next falling piece, while the 'action' determines the placement of the falling piece. 



Basis functions. We employed the 22 basis functions originally introduced by 



Bertsekas and Ioffe 



( 1996). Each basis function takes a Tetris board configuration as its argument. The functions are as 
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follows: 



- Ten basis functions, (j>o, . . . , <j)g, mapping the state to the height hjc of each of the ten columns. 

- Nine basis functions, <ftio, . . . , ipis, each mapping the state to the absolute difference between 
heights of successive columns: \hk+i — hk\,k = 1, . . . , 9. 

- One basis function, 0ig, that maps state to the maximum column height: maxj. 

- One basis function, 4>20, that maps state to the number of 'holes' in the board. 

- One basis function, 4>2i , that is equal to 1 in every state. 

• State sampling. Given a sample size S, a collection X C X of 5 states was sampled. These sampled 
were generated in an i.i.d. fashion from the stationary distribution of a (rather poor) baseline policyEl 
For each choice of sample size S, ten different collections of S samples were generated. 

• Optimization. Given the collection X of sampled states, an increasing sequence of choices of the 
violation budget 9 > is considered. For each choice of 9, the optimization program (1291 ) was solved. 

• Policy evaluation. Given a vector of weights f, the performance of the corresponding policy was 
evaluated using Monte Carlo simulation. We calculate the average performance of policy /i? over a 
series of 3000 games. Performance in measured in terms of the average number of lines eliminated in 
a single game. The sequence of pieces in each of the 3000 games was fixed across the evaluation of 
different policies in order to allow better comparisons. 

For each pair (S, 9), the resulting average performance (averaged over each of the 10 policies arising from 
the different sets of sampled states) is shown in Figure|3] Note that the 9 = curve in Figure ^corresponds to 
the original ALP algorithm. Figure |3]provided experimental evidence for the intuition expressed in Section[3] 
and the analytic result of Theorem \T\ Relaxing the constraints of the ALP even slightly, by allowing for 
a small slack budget, allows for better policy performance. As the slack budget 9 is increased from 0, 
performance dramatically improves. At the peak value of 9 = 0.16384, the SALP generates policies with 
performance that is an order of magnitude better than ALP. Beyond this value, the performance of the SALP 
begins to degrade, as shown by the 9 = 0.65536 curve. Hence, we did not explore larger values of 9. 

Table [Qsummarizes the performance of best policies obtained by various ADP algorithms. Note that all 
of these algorithms employ the same basis function architecture. The ALP and SALP results are from our 
experiments, while the other results are from the literature. The best performance result of SALP is a factor 
of 2 better than the competitors. 

Note that significantly better policies are possible with this basis function architecture than any of th e ADP 



algorithms in Table[T]discover. Using a heuristic global optimization method. ISzita and Lorinczl (12006!) report 
finding policies with a remarkable average performance of 350,000. Their method is very computationally 
intensive, however, requiring one month of CPU time. In addition, the approach employs a number of rather 



3 Our baseline policy had an average performance of 113 points. 



26 



xlO 3 

i i i i i r 




50 100 150 200 250 300 

Sample Size S 



Figure 3: Performance of the average SALP policy for different values of the number of sampled states S and the 
violation budget 9. Values for 9 were chosen in an increasing fashion starting from 0, until the resulting average 
performance began to degrade. 



Algorithm 


Best Performance 


CPU Time 


ALP 


897 


hours 


TD-Learning (Bertsekas and Ioffe, 1996) 


3,183 


minutes 


ALP with bootstrapping (Farias and Van Rov. 2006") 


4,274 


hours 


TD-Learning (Bertsekas and Tsitsiklis, 1996) 


4,471 


minutes 


Policy gradient (Kakade, 2002) 


5,500 


days 


SALP 


10,775 


hours 



Table 1: Comparison of the performance of the best policy found with various ADP methods. 



arbitrary Tetris specific 'modifications' that are ultimately seen to be critical to performance — in the absence 
of these modifications, the method is unable to find a policy for Tetris that scores above a few hundred points. 
More generally, global optimization methods typically require significant trial and error and other problem 
specific experimentation in order to work well. 
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7. Conclusion 



The approximate linear programming (ALP) approach to approximate DP is interesting at the outset for two 
reasons. First, the ability to leverage commercial linear programming software to solve large ADP problems, 
and second, the ability to prove rigorous approximation guarantees and performance bounds. This paper 
asked whether the formulation considered in the ALP approach was the ideal formulation. In particular, 
we asked whether certain strong restrictions imposed on approximations produced by the approach can be 
relaxed in a tractable fashion and whether such a relaxation has a beneficial impact on the quality of the 
approximation produced. We have answered both of these questions in the affirmative. In particular, we 
have presented a novel linear programming formulation that, while remaining no less tractable than the ALP, 
appears to yield substantial performance gains and permits us to prove extremely strong approximation and 
performance guarantees. 

There are a number of interesting algorithmic directions that warrant exploration. For instance, notice 
that from d28l . that the SALP may be written as an unconstrained stochastic optimization problem. Such 
problems suggest natural online update rules for the weights r, based on stochastic gradient methods, yielding 
'data-driven' ADP methods. The menagerie of online ADP algorithms available at present are effectively 
iterative methods for solving a projected version of Bellman's equation. TD-learning is a good representative 
of this type of approach and, as can be seen from Table [T] is not among the highest performing algorithms 
in our computational study. An online update rule that effectively solves the SALP promises policies that 
will perform on par with the SALP solution, while at the same time retaining the benefits of an online ADP 
algorithm. A second interesting algorithmic direction worth exploring is an extension of the smoothed linear 
programming approach to average cost dynamic programming problems. 

As discussed in Section HJ theoretical guarantees for ADP algorithms typically rely on some sort of 
idealized assumption. For instance, in the case of the ALP, it is the ability to solve an LP with a potentially 
intractable number of states or else access to a set of sampled states, sampled according to some idealized 
sampling distribution. For the SALP, it is the latter of the two assumptions. It would be interesting to see 
whether this assumption can be loosened for some specific class of MDPs. An interesting class of MDPs in 
this vein are high dimensional optimal stopping problems. Yet another direction for research, is understanding 
the dynamics of 'bootstrapping' procedures, that solve a sequence of sampled versions of the SALP with 
samples for a given SALP in the sequence drawn according to a policy produced by the previous SALP is 
the sequence. 
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A. Proofs for Section [472 



LemmaQ] For any r G M. K and 9 > 0: 
(i) £(r, 9) is a finite-valued, decreasing, piecewise linear, convex function of 9. 
(ii) 

1 + a 

i(r,0) < - ||J*-$r||oo. 

1 — a 

(iii) The right partial derivative of £(r, 9) with respect to 9 satisfies 

° + £(r,0) = -((l-a) W 
\ xen(r) 



-i 



89+ 
where 



Q(r) = argmax Qr(x) — T<&r(x) 



Proof, (i) Given any r, clearly 7 = ||$r — T$r||oo, s = is a feasible point for ©, so £(r, 9) is well-defined. 
To see that the LP is bounded, suppose (s, 7) is feasible. Then, for any x G X with ir^* )U {x) > 0, 

7 > <&r(x) — T&r(x) — s(x) > <3?r(x) — T&r(x) — 9 / 'n ^* v (x) > 00. 

Letting (71, s±) and (72, S2) represent optimal solutions for the LP © with parameters (r, 9\) and (r, 92 ) 
respectively, it is easy to see that ((71 + 72V2, (si + S2)/2) is feasible for the LP with parameters (r, (#1 + 
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6> 2 )/2). It follows that £(r, (9% + 6> 2 )/2) < (£(r, 61) + £(r, 9 2 ))/2. The remaining properties are simple to 
check. 

(ii) Let e = \\J* - $r||oo. Then, 

||T$r - < || J* - T^rHoo + || J* - *r||oo < a\\J* - $r||oo + e = (1 + a)e. 

Since 7 = ||T<l>r — ^rHoo, s = is feasible for (0, the result follows. 

(iii) Fix r € M^, and define 

A = max (<J?r(x) — T<£r(x)) — max (<J>r(x) — T$r(x)) > 0. 

x£X x£X\fl(r) 

Consider the program for £(r,5). It is easy to verify that for S > and sufficiently small, viz. 5 < 
^ X^cen(r) n iJ-*,u(x), (^,75) is an optimal solution to the program, where 



s s (x) = < 



^ j-^r if x G O(r), 

otherwise, 



and 



so that 



Thus, 



75 - To 



£(r, 5) = £(r, 0) 



£(r, 5) - £(r, 0) 



6 

Ea,en(r) 7r ^,»'( aJ ) 



-1 



1 - a) ^ tr^Ax) 

x<=fl(r) 



5 

Taking a limit as 5 \ yields the result. 

Lemma|2] Suppose that the vectors J G and s € satisfy 

J < T^J + s. 

Then, 

J < J* + A*s, 

where 

00 

A* 4 £(0^)* = (1-0^)-!, 

fc=0 

and P M * is the transition probability matrix corresponding to an optimal policy. 
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In particular, if (r, s) is feasible for the LP (0). Then, 

$r < J* + A*s. 

Proof. Note that the T^*, the Bellman operator corresponding to the optimal policy (j,*, is monotonic and 
is a contraction. Then, repeatedly applying 7j t * to the inequality J < 7j t * J + s and using the fact that 
TKj —> J*, we obtain 

oo 

J <J* + Y,( aP ^) kg = J* + A*s. 

fc=0 



B . Proof of Lemma [3] 

We begin with the following definition: consider a family T of functions from a set 5 to {0, 1}. Define the 
Vapnik-Chervonenkis (VC) dimension dimyc (-T 7 ) to be the cardinality d of the largest set {x\ ,X2, ■ ■ ■ ,Xd} C 
S satisfying: 

Ve G {0, l} d , 3/ E ^ such that W, = 1 iff e* = 1. 

Now, let .Fbe some set of real- valued functions mapping S to [0, B]. The pseudo-dimension dimp(^ r ) 
is the following generalization of VC dimension: for each function / G T and scalar c € R, define a function 
j:5xR^{0,1} according to: 

g(x,c) = I{/(x)- c >o}- 

Let Q denote the set of all such functions. Then, we define dimp(jF) = dimvc(^)- 

In order to prove Lemma |3j define the T to be the set of functions / : R A xM-> [0, B], where, for all 
x € R A ' and y G R, 

f(y,z) = C (r T y + zj . 

Here, ((t) = max (min(i, B), 0), and r € R A is a vector that parameterizes /. We will show that 
dimp(^) < K + 2. 

We will use the following standard result from convex geometry: 

Lemma 4 (Radon's Lemma). A set A C R m of m + 2 points can be partitioned into two disjoint sets A\ and 
A2 , such that the convex hulls of A\ and A2 intersect. 

Lemma 5. dimp^) < K + 2 

Proof. Assume, for the sake of contradiction, that dimp^) > K+2. It must be that there exists a 'shattered' 
set 

{ (y« ,* « , c« ) , (y® , z (2) , c (2) ),..., ( y (^+3) , z (^+3) ? c (K+3) ) | c R A ' X R X R, 
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such that, for all e <G {0, 1} +3 , there exists a vector r e <G M with 

C (rjv® + z«) > c W iff ei = 1, V 1 < i < K + 3. 

Observe that we must have cW g (0, 5] for all i, since if cW < or cW > B, then no such shattered set 
can be demonstrated. But if c w G (0, B], for all r G R K , 

C (r T y {i) + z W ) > c W rjy« > c W - z W , 

and 

C (rV° + < c (i) => rJyW < c (i) - z (i) . 
For each 1 < i < K + 3, define a;W G IR^ +1 component-wise according to 

J | C W-«W ifj = A" + l. 

Let A = {x^, x( 2 ), . . . , x^ +3 ^} C E^ +1 , and let and A 2 be subsets of A satisfying the conditions of 
Radon's lemma. Define a vector e G {0, 1}^ +3 component-wise according to 

g i - n {xWeAi}- 

Define the vector f = rg. Then, we have 

y^fjXj > xk+i, Vie4 
j'=i 

^ TjXj < XftT+i, V X G A 2 . 

J'=l 

Now, let x G R /ir+1 be a point contained in both the convex hull of A\ and the convex hull of A2. Such 
a point must exist by Radon's lemma. By virtue of being contained in the convex hull of A\, we must have 

K 

^ fjXj > XK+l- 
3=1 

Yet, by virtue of being contained in the convex hull of A2, we must have 

K 

J^TjXj < XK+1, 
3=1 
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which is impossible. 



With the ab ove pseudo-dimension estimate, Lemma [3] follows immediately from Corollary 2 of of 



Hausslerl (119921 Section 4). 
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